Tree structured sparse coding on cubes 
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Several recent works have discussed tree structured sparse coding fS^TO' T 'Sl, where N data points 
in R*^ written as the d x N matrix X are approximately decomposed into the product of matrices 
^ ■ WZ. Here W is a dxK dictionary matrix, and Z is a KxN matrix of coefficients. In tree structured 

sparse coding, the rows of Z correspond to nodes on a tree, and the columns of Z are encouraged to 
be nonzero on only a few branches of the tree; or alternatively, the columns are constrained to he on 
at most a specified number of branches of the tree. 

When viewed from a geometric perspective, this kind of decomposition is a "wavelet analysis" of 
the data points in X ID |6] (TT] [T] . As each row in Z is associated to a column of W, the columns 
of W also take a tree structure. The decomposition corresponds to a multiscale clustering of the 
data, where the scale of the clustering is given by the depth in the tree, and cluster membership 
corresponds to activation of a row in Z. The root node rows of Z corresponds to the whole data 
set, and the root node columns of W are a best fit linear representation of X. The set of rows of Z 
corresponding to each node specify a cluster- a data point x is in that cluster if it has active responses 
in those rows. The set of columns of W corresponding to a node specify a linear correction to the 
best fit subspace defined by the nodes ancestors; the correction is valid on the corresponding cluster. 
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0^ I Here we discuss the analagous construction on the binary cube { — 1,1}''. Linear best fit is replaced 

. by best fit subcubes. 
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1 The construction on the cube 
1.1 Setup 



We are given N data points in B'' = { — 1, l}'' written as the d x N binary matrix X. Our goal is to 
decompose X as a tree of subcubes and "subcube corrections". A q dimensional subcube C = Ccj^ 
^ . of B'' is determined by a point c e B'*, along with a set of d — q restricted indices = ri, ...,rd-q- 

The cube Ccj^ consists of the points & G B'^ such that = Cr- for all G that is 

Ccjr = {fe e B'^ s.t. 5,., \fn e /''}■ 

The unrestricted indices /" = {1, ...,d} \ /' can take on either value. 
1.2 The construction 

Here I will describe a simple version of the construction where each node in the tree corresponds to 
a subcube of the same dimension q, and a hard binary clustering is used at each stage. Suppose our 
tree has depth I. Then the construction consists of 

1 . A tree structured clustering of X into sets Xij at depth (scale) i G {1, ...,1} such that 
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2. and cluster representatives (that is d — ig-dimensional subcubes) 
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such that the restricted sets have the property that if ij is an ancestor of i'j', 
and 

for all s e /j: 

Here each c^j is a vector in B'^; the complete set of Cjj roughly corresponds to W from before. 
However, note that each Cij has precisely d — iq entries that actually matter; and moreover because 
of the nested equalities, the leaf nodes carry all the information on the branch. This is not to say 
that the tree structure is not important or not used- it is, as the leaf nodes have to share coordinates. 
However once the full construction is specified, the leaf representatives are all that is necessary to 
code a data point. 

1.3 Algorithms 

We can build the partitions and representatives starting from the root and descending down the tree 
as follows: first, find the best fit d — g dimensional subcube for the whole data set. This is given 
by a coordinate-wise mode; the free coordinates are the ones with the largest average discrepancy 
from their modes. Remove the q fixed coordinates from consideration. Cluster the reduced (d — q 
dimensional) data using K means with = 2; on each cluster find the best fit (d — q) — q cube. 
Continue to the leaves. 

1.3.1 Refinement 

The terms Cdj.i^. and Xij can be updated with a Lloyd type alternation. With all of the Xij fixed, 
loop through each C from the root of the tree finding the best subcubes at each scale for the current 
partition. Now update the partition so that each x is sent to its best fit leaf cube. 

1.3.2 Adaptive q, I, etc. 

In fT\, one of the important points is that many of the model parameters, including the q, I, and 
the number of clusters could be determined in a principled way. While it is possible that some of 
their analysis may carry over to this setting, it is not yet done. However, instead of fixing q, we can 
fix a percentage of the energy to be kept at each level, and choose the number of free coordinates 
accordingly. 

2 Experiments 

We use a binarized the MNIST training data by thresholding to obtain X. Here d = 28^ and 
= 60000. Replace 70% of the entries in X with noise sampled uniformly from { — 1.1}, and 
train a tree structured cube dictionary with q ~ 80 and depth I — 9. The subdivision scheme used to 
generate the multiscale clustering is 2-means initialized via randomized farthest insertion [2 |; this 
means we can cycle spin over the dictionaries [5 1, to get many different reconstructions to average 
over. In this experiment the reconstruction was preformed 50 times for the noise realization. The 
results are visualized below. 
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Figure 1 : Results of denoising using the tree structured coding. The top left image is the first 64 
binarized MNIST digits after replacing 70% of the data matrix with uniform noise. The top right 
image is recovered, using a binary tree of depth I = 9 and q = 90, and 100 cycle spins, thus the non- 
binary output, as the final result is the average of the random clustering initialization (of course with 
the same noise realization). The bottom left image is recovered using robust pea |j4|, for comparison. 
The bottom right is the true binary data. 
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